Linked Data Registry: A New Approach To Technical Registries

نویسندگان

  • Maïté Braud
  • James Carr
  • Kevin Leroux
  • Joseph Rogers
  • Robert Sharpe
چکیده

Technical Registries are used in digital preservation to enable organizations to maintain definitions of the formats, format properties, software, migration pathways etc. needed to preserve content over the long term. There have been a number of initiatives to produce technical registries leading to the development of, for example, PRONOM, UDFR and the Planets Core Registry. However, these have all been subject to some criticisms. One problem is that either the information model is fixed and difficult to evolve or flexible but hard for users to understand. However, the main problem is the governance of the information in the registry. This has often been restricted to the host organization, which may have limitations on the investment they can make. This restriction has meant that, whilst other organizations have, perhaps, been free to use the registry they have been unable to add to or edit the information within it. The hosts of the registries have generally been receptive to requests for additions and change but this has still led to issues with timing or when different organizations cannot agree (or just utilise or interpret things in different ways). In this paper we describe a new approach, which has used linked data technology to create the Linked Data Registry (LDR). This approach means it is simple to extend the data model and to link to other sources that provide a more rounded description of an entity. In addition, every effort has been made to ensure there is a simple user interface so that users can easily find and understand the information contained in the registry. This paper describes what is believed to be the first linked data technical registry that can be deployed widely. The key element of the new approach is the distributed maintenance model which is designed to resolve the governance problem. Any organization hosting an LDR instance is free to add and edit content and to extend the model. If an instance of LDR is exposed on the internet, then any other organization is free to retrieve this additional information and hold it in its own LDR instance, alongside locally maintained information and information retrieved from other sources. This means a peer-to-peer network is established where each registry instance in the network chooses which other registry instances to trust and thereby from whom to receive which content. This gives control to each individual organization, since they are not dependent on anyone else but can choose to take different content from appropriate authoritative sources. At the same time it allows collaboration to reduce the administrative burden associated with the maintenance of all of the information. Introduction Role of Technical Registries One of the key threats to the preservation of digital material is that “Users may be unable to understand or use the data, e.g., the semantics, format, processes or algorithms involved” (Kuipers, 2009). This issue is addressed in the OAIS model through the development of Representation Information networks (CCSDS, 2012). Some of this might be specific to a given Information Object (e.g., data from a one-off experiment might need to record information related to the instrument calibration and quality control that took place) or it might apply very commonly (e.g., the need to understand the specification of PDF/A). This means that Representation Information networks will consist of some information maintained locally (to hold information specific to the Information Objects held in that repository) and some information that is probably best Maïté Braud, Pauline Sinclair, and Robert Sharpe Girona 2014: Arxius i Industries Culturals 2 maintained remotely from the repository (or at least it can be done more efficiently, e.g., not every organization using PDF/A needs to be an expert in the details of its specification). The need for Representation Information networks is well established in data-holding institutions. This is because, for example, data gathering often utilises new combinations of techniques, methods and algorithms and thus, in order to be able to understand the results, a repository needs to be able to reference information related to these and yet does not necessarily want to repeat this information with every data set. In memory institutions traditionally the problem has been handled in different ways using different terminology but conceptually it is the same approach. For example, usually such institutions create a catalogue entry to describe (at least at a high level) each record it holds. This catalogue entry, as well as describing information specifically about the record, may reference other information (e.g., a description of the collection to which the record belongs, or links to other controlled sources such as organizations, people or events related to the record). These controlled sources are then described in turn (externally to the individual catalogue entry) and may, themselves, reference another external source. This creates a network of information that helps a user to understand the semantics of a record. For example, imagine a genealogist looking at the history of an ancestor. From the records of a national archive, they might be able to find out that their ancestor was in the army and served in a given regiment between two dates. The national archive might maintain a separate list of information about every regiment in the national army but it might not contain detailed information about each regiment, such as where that regiment was posted on a given date. However, this information might be available from a regimental museum. Hence, a given user (with sufficient knowledge and skill) can find out where their ancestor was posted on a given date through the use of a network of representation information that will involve information held with the record, information explicitly linked to the record and information implicitly linked to the record. For memory institutions, this sort of network applies to paper records as well as digital records and they have been in existence for some time. The advent of digital technology has made catalogues of information easier to maintain, more accessible, easier to search and easier to link to each other but the fundamental information storage and retrieval process has not changed. However, the advent of digital information has led to new problems such as the ability to continue to interpret, for example, a file of a specific format that constitutes all or part of the original record. To solve this problem various attempts have been made to add such information to the existing, relevant representation information networks. This has included the development of ‘Technical Registries’ which are designed to be repositories of key facts about things that are important to the environment needed to interpret digital records and/or the environment needed to preserve such records. There have been a number of high profile attempts to create such a registry including PRONOM (http://www.nationalarchives.gov.uk/PRONOM/Default.aspx), UDFR (http://www.udfr.org) and the Planets Core Registry (http://www.openplanetsfoundation.org/planets-core-registry). These registries have provided significant advantages and at least some of them are in regular use. PRONOM, for example, is used as the basis for the format signatures that underpin the widelyused file format identification tool, DROID (http://digital-preservation.github.io/droid), while the Planets Core Registry has been used as the basis for automated characterisation and migration decisions within Preservica’s (part of the Tessella group) digital preservation systems: Preservica EE (formerly known as SDB) and Preservica CE (http://preservica.com). Other initiatives such as the "Solve the File Format Problem" (http://fileformats.archiveteam.org/) or the Community Owned digital Preservation Tool Registry (COPTR) (http://coptr.digipres.org/) have already demonstrated the benefit of using crowd sourcing to collate information relevant to the Digital Preservation community but these repositories do not offer machine-to-machine interfaces and thus are aimed mainly at researchers or manual curation. Maïté Braud, Pauline Sinclair, and Robert Sharpe Girona 2014: Arxius i Industries Culturals 3 Limitations of current registries However, all of these registry initiatives have also been subject to two main criticisms. The first is that the set of entities modelled, the properties held about such entities and their relationship to other entities has been hard to expand and/or hard to interact with. Either of these issues makes it hard to integrate this information as part of a representation information network. For example, it would be desirable to be able to link a locally-held record about a format to, say, its formal specification. In some existing registries this could be done by, say, uploading a copy to the Technical Registry but then this would not be updated if some error was found in the specification and updated on, say, the official website. There have been two contrasting approaches to this issue of expandability and usability. The first has been to use a fixed-schema database with a user interface intricately linked to that schema. This approach (used in PRONOM and the Planets Core Registry) makes the system easy to use but hard to expand. The alternative approach (used in UDFR) has been to use a linked data approach which is easier to expand. However, linked data is a technology designed for computer-to-computer interactions, meaning that it can be hard for non-technical users to interact with the information. UDFR has made some effort to create a user interface to help with this but arguably it is harder to use the software to find information than, for example, in the fixed-schema, harder-to-expand PRONOM system. The issue has already been raised in previous papers, and initiatives such as the P2-Registry (Tarrant, 2011) recognised and proved the benefit of the Linked Data approach while highlighting that exposing SPARL query interfaces directly to end users might be too complex for a lot of people to use. The second issue is one of governance of the information. Since these registries have been used by organizations other than their hosts, there have been issues about what to do when information is incomplete, in error or possibly subject to just being an opinion. For example, some organizations have wanted to extend the range of formats that is covered by PRONOM. The UK National Archives (the hosts of PRONOM) have been as proactive as possible at supporting such requests but the need for them to go through appropriate checks and their limited resources means that it can take some time before a request leads to a registry update. In addition, there have also been cases where there have been disagreements within the community about format definitions, and cases where an information update has changed existing behaviour causing systems that relied on the previous behaviour to stop working as expected. New Approach This paper describes a new type of Technical Registry designed to solve these problems: the Linked Data Registry (LDR). Like UDFR it uses linked data technology (http://linkeddata.org/), which allows flexible linking of resources to other resources thereby offering a solution to the expandability part of the first issue. In addition the registry aims to be as easy to search, and to view and edit entities as a fixedschema system. This means it also offers a solution to the usability part of the first issue. Searches of linked data systems use a search language called SPARQL that is conceptually similar to the structured query language (SQL) used by more traditional relational databases. In many linked data systems a SPARQL end point is considered sufficient to allow for searching, viewing and editing of content. However, the users of a Registry should not be assumed to be sufficiently technically savvy to write queries using SPARQL or to be able to interpret the raw results, any more than users of a traditional relational database would be expected to write SQL statements or interpret the raw results this would produce. Creating a method of allowing searching, viewing and editing of linked-data information in a manner that is natural to nontechnical users is a non-trivial issue that has been the subject of considerable research effort (Davies, 2010). In this paper we describe how we have attempted to solve this problem. It is inevitably a design compromise but one that we believe is optimized to balance expandability and ease of use. Maïté Braud, Pauline Sinclair, and Robert Sharpe Girona 2014: Arxius i Industries Culturals 4 Crucially, LDR also addresses the issue of governance. It allows a network of registries to be created that can be replicated peer-to-peer, thereby removing the need for any organization to be dependent on any other for the maintenance of information, unless it chooses to be so. Linked Data Linked data is becoming a more commonly used technology but some readers may be unfamiliar with it or unclear what terminologies such as resource, subject, predicate and object mean. This section provides a very brief introduction, which should be sufficient to understand the rest of this paper. A resource is the linked data term for an entity; examples include file format, software and migration pathway. A resource needs to be identified uniquely by a URI (Uniform Resource Identifier). A resource is described by a set of statements (expressed as subject predicate object). Statements can be either simple or complex: • A simple statement is a statement where the object is of a simple type: e.g., a String or an Integer, but crucially not another resource. • A complex statement is a statement where the object is another resource. For example: • “Resource A” “has MIME type” “image/jpeg” • “Resource A” “has PUID” “fmt/44” • “Resource A” “has extension” “JPEG” • “Resource A” “has extension” “JPG” • “Resource A” “has version” “1.02” are all simple statements in the form subject predicate object that describe and identify resource A (aka JPEG file format v1.02). Resource A “has internal signature” Resource B (where resource A is a file format and resource B is a DROID internal signature) is an example of a complex statement. In this case the DROID internal signature object will itself be an agglomeration of statements that define and describe it. Information Modelled In this first version of LDR the information modelled needed to be sufficient to allow efficient (and automated) preservation-related activities to take place. However, after meeting this sufficiency criterion, the data model has been minimised deliberately. This was done partly to keep the problem tractable but also partly based on the experience of developing the Planets Core Registry. In that project we found that there was a wish to expand the data model to include every attribute that might possibly be needed in the future. This was understandable since the technology used (a relational database with a fixed graphical user interface) meant that it was hard to expand the system after it was initially completed. However, this meant in practice that large tracts of the data model were left unpopulated. Perhaps worse was that it was not clear if the lack of information meant that the data model was not useful, the information was not valuable enough to be collected, the information was too hard to collect, or maybe it had not been collected yet. Hence, in this study, it was decided to use a technology that was much easier to expand (linked data) and to start out by only modelling the information that was known to be of interest (essentially the entities that were populated in the Planets Core Registry). These entities could be split into two classes: factual information (information that could reasonably be expected to be held in common by lots of agencies without controversy) and policy information (information about what to do when, that might be relevant to only one repository). In LDR these two classes of information are held separately, but still linked. It should be emphasized that this is not a hard and fast distinction: just a pragmatic one. Hence, it is possible for organizations to disagree about information (such as the exact definition of a format) Maïté Braud, Pauline Sinclair, and Robert Sharpe Girona 2014: Arxius i Industries Culturals 5 while it is also possible for organizations to share policies. The use of a peer-to-peer network (see the “Replication” section) allows both of these cases to be covered. Factual Information The Linked Data Registry models a number of key factual entities aggregated into five groups: • File formats (with associated DROID internal signature and byte sequences) • Software • Related software tools (including the tool’s purpose and parameters) • Migration pathways • Properties and property groups The decision to create these five groups of entities was based on how these entities are used by users. For example, a user would naturally view, create or edit information about a format and then expect to add or create an internal signature for that format. Linked data concepts mean that this relationship could be considered the other way around (i.e. internal signatures are associated with formats) especially given that a single internal signature is often associated with multiple formats. However, humans tend to look up the signatures associated with formats more often than the other way round and would tend to add new signatures based off information derived from a format’s specification. This aggregation is important for the user interface needed to interact with the system (see the “Search, View, and Edit” section below). It is less important from a technical perspective which can safely consider the resources to be linked to each other from any perspective. The impact of this aggregation on the expandability of the model is discussed in the “Expansion” section below. Each of these five groups of entities is discussed in turn. Format Information This entity group models file formats, including internal signatures and the byte sequences of internal signatures. It is based on the model established by the UK National Archives as part of their Linked Data PRONOM research project (http://labs.nationalarchives.gov.uk/wordpress/index.php/2011/01/linked-data-and-pronom). Attribute Repeatable? Link to other Resource

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Two approaches to linking census and hospital data.

BACKGROUND This study compares registry and non-registry approaches to linking 2006 Census of Population data for Manitoba and Ontario to hospital data from the Discharge Abstract Database (DAD). DATA AND METHODS Using a probabilistic linkage, the registry approach linked the census data to provincial health insurance registries, followed by a deterministic linkage to the DAD based on health ...

متن کامل

Local Council Decisions as Linked Data: a Proof of Concept

Base registries are trusted authentic information sources controlled by an appointed public administration or organization appointed by the government. Maintaining a base registry comes with extra maintenance costs to create the dataset and keep it up to date. In this paper, we study the possibility to entangle the maintenance of base registries at the core of existing administrative processes ...

متن کامل

Changing roles of population-based cancer registries in Australia.

Registries have key roles in cancer incidence, mortality and survival monitoring and in showing disparities across the population. Incidence monitoring began in New South Wales in 1972 and other jurisdictions soon followed. Registry data are used to evaluate outcomes of preventive, screening, treatment and support services. They have shown decreases in cancer incidence following interventions a...

متن کامل

O-32: Status of Human ART in Spain: Resultsfrom the New Registry of Catalonia

Background: FIVCAT.NET is the registry of human assisted reproductive techniques (ART) in Catalunya to which all authorised centres are obliged to declare their activities. The Background of the present study is to describe the data on effectiveness of the ART in Catalunya over the period 2001-2005 and to compare our findings with other similar registries. Materials and Methods: The data were o...

متن کامل

Design, Implementation, and Applicability Evaluation of Hip and Knee Arthroplasty Registry

Introduction: Arthroplasty is a major orthopedic operation with an increasing rate. The success of this operation can significantly reduce patients’ pain and disabilities. This study aimed to design a registry system for hip and knee arthroplasties. Method: A comprehensive search was conducted to retrieve minimum data set from articles, guidelines, forms and reports published by orthopedic soci...

متن کامل

Using the National Death Index to Identify Duplicate Cancer Incident Cases in Florida and New York, 1996–2005

INTRODUCTION Cancer registries link incidence data to state death certificates to update vital status and identify missing cases; they also link these data to the National Death Index (NDI) to update vital status among patients who leave the state after their diagnosis. This study explored the use of information from NDI linkages to identify potential duplicate cancer cases registered in both F...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014